Batch main sumcheck across chips by hero78119 · Pull Request #1333 · scroll-tech/ceno

hero78119 · 2026-04-29T13:52:33Z

Problem

Main sumcheck was proved and verified per chip, which duplicated transcript work, selector/claim handling, and PCS opening plumbing across chips.

Design Rationale

Use one global batched main sumcheck proof while keeping PCS openings in the existing suffix path. The verifier mirrors the prover transcript order, including ECC bridge sampling before the global combine subset evals challenge, and evaluates frontloaded expressions in the verifier.

Change Highlights

ceno_zkvm: batches main constraints into a single global proof path across chip proofs.
ceno_zkvm: keeps witness/fixed PCS openings per chip after global main verification.
ceno_recursion: mirrors native verifier changes for the batched main proof.
ceno-gpu: supports the batched main proving flow.

Benchmark / Performance Impact

Benchmark session compares the frontload baseline against successive feat/batch_main_sumcheck optimization runs on block 23817600, GPU proving, CENO_GPU_ENABLE_WITGEN=0.

Comparison convention: lower time is better. Signed x values use -Nx for slower-than-baseline wall time and +Nx for faster/lower-time metrics; for example, taking twice as long is -2.00x.

Timeline / Optimization Progress

Date	Run	Ceno / GPU Commit	E2E	vs Baseline	app_prove	vs Baseline	prove_batched_main_constraints	Short Highlight
May 6	25419833788 / job 74559223217	Ceno `7a07649b`, GPU `1118dca8`	75.600s	Baseline	61.000s	Baseline	0.000s	Baseline: frontload, per-chip main constraints
May 9 AM	25594090744 / job 75136918384	Ceno `dd229c00`, GPU `340651b4`	103.000s	-1.36x	87.400s	-1.43x	0.000s	Batched branch after alpha.28 upgrade; tower/extract totals much lower but wall time regressed
May 9 PM	25603601935 / job 75161599043	Ceno `d5ae1b3a`, GPU `fbef26f3`	104.000s	-1.38x	88.300s	-1.45x	26.925s	Batched main proof enabled; new batched-main critical path dominates
May 11	25655529702 / job 75302942526	Ceno `c2c45cc9`, GPU `3dedbc78`	91.800s	-1.21x	76.500s	-1.25x	15.457s	Latest optimization: direct batched-main construction + bucketed fold/eval GPU sumcheck

E2E / Layer

Metric	Baseline	Latest Optimization	Comparison
E2E total	75.600s	91.800s	-1.21x
emulator	10.100s	10.200s	-1.01x
app_prove wall time	61.000s	76.500s	-1.25x

App Prove Breakdown

Profiler module totals can overlap because chip proving is concurrent; use app_prove wall time above for critical-path impact. The latest run materially reduces the new batched-main cost, but total wall time is still slower than the frontload baseline.

Operation	Baseline	Batched May 9 AM	Batched May 9 PM	Latest May 11	Latest vs Baseline
prove_batched_main_constraints	0.000s	0.000s	26.925s	15.457s	New cost
prove_main_constraints	22.622s	0.000s	0.000s	0.000s	Removed
extract_witness_mles	24.155s	3.760s	3.713s	3.739s	+6.46x
build_tower_witness_gpu	3.491s	0.323s	0.316s	0.323s	+10.81x
prove_tower_relation_gpu	176.090s	24.008s	24.417s	24.857s	+7.08x
pcs_opening	15.246s	15.207s	15.164s	15.175s	+1.00x
commit_traces	6.827s	6.814s	6.851s	6.857s	-1.00x
parsed rows total	251.118s	50.995s	78.287s	67.460s	+3.72x

Latest Improvement Against Previous Batched Run

Metric	May 9 PM Batched Main	May 11 Latest	Improvement
E2E total	104.000s	91.800s	+1.13x
app_prove wall time	88.300s	76.500s	+1.15x
prove_batched_main_constraints	26.925s	15.457s	+1.74x
parsed rows total	78.287s	67.460s	+1.16x

Benchmark command:

CENO_GPU_ENABLE_WITGEN=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_CACHE_LEVEL=0 \
RUSTFLAGS="-C target-feature=+avx2" \
cargo run --features "jemalloc,gpu" --release --bin ceno-reth-benchmark-bin -- \
  --mode prove-app --block-number 23817600 --rpc-url <redacted> \
  --output-dir output --cache-dir rpc-cache

Environment:

GitHub self-hosted GPU runner, CUDA device cc=8.9, 24GB GPU memory.
Rust nightly-2025-11-20, cargo 1.93.0-nightly.
Baseline: run 25419833788 / job 74559223217, Ceno 7a07649b, GPU 1118dca8, summary.
2026-05-09 early batched branch: run 25594090744 / job 75136918384, Ceno dd229c00, GPU 340651b4, summary.
2026-05-09 batched-main critical path: run 25603601935 / job 75161599043, Ceno d5ae1b3a, GPU fbef26f3, summary.
Latest optimization: run 25655529702 / job 75302942526, Ceno c2c45cc9, GPU 3dedbc78, summary.

Summary: latest optimization improves prove_batched_main_constraints by +1.74x against the previous batched-main run (26.925s -> 15.457s) and improves E2E by +1.13x (104.000s -> 91.800s). It remains slower than the frontload baseline (75.600s -> 91.800s, -1.21x), with the remaining gap concentrated in the new batched-main critical path.

Testing

RUST_MIN_STACK=33554432 cargo check --package ceno_recursion --bin e2e_aggregate
RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

Also passed the linked GPU e2e benchmark run above.

Risks and Rollout

Soundness risk is concentrated in transcript ordering and verifier frontload evaluation; native and recursion verifiers now follow the same global proof flow.
Performance is not yet an E2E win in the linked benchmark despite removing per-chip main-constraint cost; further scheduling/host-overlap work is needed before rollout as a performance improvement.

Follow-ups

Investigate reducing the new prove_batched_main_constraints critical-path cost.
Keep benchmark summaries explicit that parsed module totals overlap and are not a wall-time decomposition.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

…_mle_zero_padding

…/ceno into feat/prover_mle_zero_padding

…_mle_zero_padding

…heck

Build batched main sumcheck virtual polynomials directly from monomial terms instead of reconstructing a large Expression tree and monomializing it again. This removes expensive expression rebuild work on CPU proof generation while preserving proof semantics. Also extend integration timeout to allow the existing slow batched proof path to complete after increasing stack size.

hero78119 added 30 commits April 25, 2026 23:18

refactor GPU compact tower witness flow

ac49ac6

Fix compact tower memory accounting

84a2631

Optimize compact logup ones allocation

12453f6

update dep

7d60f01

Merge branch 'master' into feat/prover_mle_zero_padding

925de92

fix main mem estimation

e9fbe9c

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

46e87bb

…_mle_zero_padding

Merge branch 'feat/prover_mle_zero_padding' of github.com:scroll-tech…

b888fbb

…/ceno into feat/prover_mle_zero_padding

fix mem estimator

5ecce04

snapshot compact tower estimator state

be14006

rollback Cargo.toml, Cargo.lock change

df88dec

fix memory estimation

b57b692

verifier log

c50b793

Pass tower input by value for GPU proving

89b8698

split tower layer by view

f210e1f

Use dense tower build for compact GPU input

99b7a94

Pass logup shape to tower prove estimator

f0d81b6

Deduplicate borrowed tower input booking

917810c

fix logging

4fc8dae

Check scheduler memory estimate in mem tracking

ef9fa30

Refine replay tower proof memory estimate

011a898

clippy fix

f3ca1cf

add missing syncronization, avoid race condition

147f567

Account ShardRam tower prove allocator overhead

94fc7bf

misc: clippy fix

c9401d1

Fix GPU proof memory estimation

d14e66a

Fix GPU proof estimate row basis

ceced51

Tune ShardRam tower proof estimate

d1ab71a

Batch main constraints into single sumcheck

7c6e97c

Restore replay backing before batched main

505e258

hero78119 added 5 commits April 29, 2026 13:35

Replay witness backing incrementally during PCS opening

b2fba0f

wip more log

25d7f42

Improve GPU proof failure diagnostics

2128bf9

Compact ShardRAM main witness extraction

d5513de

Log batched main MLE histograms

2df2590

hero78119 marked this pull request as draft April 29, 2026 13:52

hero78119 mentioned this pull request Apr 29, 2026

Replay PCS traces incrementally #1332

Closed

hero78119 added 9 commits April 30, 2026 09:57

Fix batched main GPU verification

b67c6b7

Use legacy layout for batched main GPU sumcheck

a4d066f

update gkr dependency

29ae6df

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

7d1a9de

…_mle_zero_padding

Merge branch 'feat/prover_mle_zero_padding' into feat/batch_main_sumc…

7ebc0cf

…heck

perf(gpu): trim batched main sumcheck work

fb96061

perf(gpu): use direct layout for batched main

be5a6f5

Experiment staggered batched main sumcheck prover

27ed865

Use dedicated batched main sumcheck prover

268025b

Base automatically changed from feat/prover_mle_zero_padding to master May 4, 2026 07:55

hero78119 added 4 commits May 9, 2026 13:55

chore: checkpoint frontload integration

87f85be

chore: upgrade gkr-backend to alpha.28

dd229c0

feat(recursion): verify batched main sumcheck

23d5a6a

misc: cleanup

fa3bdb7

hero78119 changed the title ~~batch main sumcheck~~ Batch main sumcheck across chips May 9, 2026

misc: cleanup

f3aaefb

hero78119 marked this pull request as ready for review May 9, 2026 12:22

hero78119 added 2 commits May 9, 2026 21:06

fix ci

d5ae1b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch main sumcheck across chips#1333

Batch main sumcheck across chips#1333
hero78119 wants to merge 51 commits intomasterfrom
feat/batch_main_sumcheck

hero78119 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hero78119 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design Rationale

Change Highlights

Benchmark / Performance Impact

Timeline / Optimization Progress

E2E / Layer

App Prove Breakdown

Latest Improvement Against Previous Batched Run

Testing

Risks and Rollout

Follow-ups

Copilot Reviewer Directive (keep this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hero78119 commented Apr 29, 2026 •

edited

Loading